|

|

Telecom Churn Case Study

|

| Mobile: +91 9491392912 | | vamshi.krishna.prime@gmail.com |

|


Table of contents:

  1. Case Study Overview
  1. Explore Data

1. Case Study Overview

================================

In the telecommunication industry, customers tend to change operators if not provided with attractive schemes and offers. It is very important for any telecom operator to prevent the present customers from churning to other operators. As a data scientist, the task in this case study would be to build an ML model which can predict if the customer will churn or not in a particular month based on the past data.

2. Data Exploration

===========================

[2](#2.-Explore-Data).1 Load Libraries:

Load relevant libraries at this section to ease the maintainance and tracking.

[2](#2.-Explore-Data).2 Load Data:

Import data dictionary to understand the variables:

Import Train dataset:

[2](#2.-Explore-Data).3 Understand Variables:

Visually inspect the `data dictionary`

Visually inspect the `train` data

[2](#2.-Explore-Data).3 Understand Dataset:

Filter variables with null_count greater than zero, sort them by descending order, and highlight the null count in red color .

However, the columns with null count are not dropped from the dataset as it could be a pattern on a time series. Impute with value 0 if appropriate.


Exploratory Data Analysis

The above plots depicts that the distribution of data is left skewed. Which is an indication that very less proportion of users have spent high average revenue while more users spent low average revenue.

The above plots depict that the Total reach data of month 6, 7 and 8 are right skewed and Max_reach_data of every month is having outliers

Data Warangling:

Take a backup of the data prior to data manipulation:

Handling Missing Values

Check for standard deviation to understand the variance:

Imputing missing values

Explore total_rech_data_6,date_of_last_rech_data_6 columns:

Suppose a customer is not reacharging on a particular date of a month , then total_rech_data_ofmonth , date_of_last_rech_data_of month values remains empty

Validate the data imputation:

Imputing missing values in categorical variables:

Drop variables based on threshold of missing values:

Check for the count of variables dropped from the analysis:

Iterate the process:


Data Trasformation:

Categorize the customers into High value customers and generic based on the revenue/value generation. Calculation based on the past data which is for the month of june and july.

Formula to calculate total data recharge amount = number of recharges * avg recharge amount

To find the total amount spent by the customer for recharge for a particular month = total data recharge + total recharge

Here high value customers are treated as those who spent greater than the 70 percentile, choise can be 80 percentile also

21013 rows reamins after selecting the customers who have provided recharge value of more than or equal to the recharge value of the 70th percentile customer.

Calculating the difference between the 8th month and the average of the 6th and 7 th month


Univariate Analysis:

Bivariate Analysis:

Cap outliers in all numeric variables with k-sigma technique

Multivariate Ananlysis:

Let's plot the correlations on a heatmap for better visualisation:


Model Building and Prediction:

Aggregating the categorical columns:


PCA - Principal component analysis:


PCA and Logistic Regression:


Evaluate on test data:

Perform testing on test.csv, kaggle file:


Hyperparameter tuning - PCA and Logistic Regression:


Random Forest model:

Perform testing on test.csv, kaggle file for Random forest:

Exporting the csv file for Kaggle submissioon:


Choosing best features:


Feature Importance:

Extracting top 30 features:

Top Features:


Extraction of the intercept and the coefficients from the logistic model:


Business Insights:


Credentials:

===================

Please find the contributors details below:

Primary Author:


|

End of Case Study

|